Urban Mobility Analysis

Literature Review

Our primary source of insight and conceptual foundation was the OECD’s EPIC (Environmental Policies and Individual Behaviour Change) survey [1], conducted across three rounds (2008 to 2022), with an average of 13,000 observations per round and a minimum of 10 developed countries. This study provided insights into household decision-making and transport behaviors, which guided the selection of our parameters and segmentation approach.

Specifically, we followed EPIC’s structure to incorporate socio-demographic variables like car ownership, household income, and the presence of children, as a foundation for our analysis. Additionally, we used this article as our reference for segmentation by trip characteristics (e.g., origin-destination regions, purpose). This methodology enabled us to have more homogeneous datasets to better capture the effects of the parameters of interest. Here is a sample outcome of the EPIC survey, which we tried to reproduce in the preliminary parts of our study.

Choice of commuting mode in the EPIC survey [1]
Choice of commuting mode in the EPIC survey [1]

Research Statement

This study examines two main aspects related to car usage within households: what factors influence a household’s decision to own a car, and what factors determine how car owners decide to use their car as their primary mode of transportation.

We sought to understand the key drivers behind car ownership and the variables that influence car usage among those who own one.

Data Preparing

Dataset

AllgreD

Label Missing_Percentage
nbarret Number_of_Stops_During_Trip 99.842185
abonpeage Public_Transport_Pass_Holder 98.538493
motdeacc Accompanied_Persons_Purpose_at_Destination 90.723206
motoracc Accompanied_Persons_Purpose_at_Origin 90.699190
NAT_STAT Parking_Type_Used 57.407026
NUM_VEH Vehicle_Number_Used 54.333059
NB_OCCU Number_of_Occupants_in_Vehicle 54.333059
LIEU_STAT Parking_Location_Used 54.333059
durstat Parking_Duration 54.333059
autoroute Highway_Usage 54.333059
prisecharge Transport_Costs_Covered 31.182242
ntraj Number_of_Stops_in_Trip 27.624537
TPS_MAP_DEP Walking_Time_At_Origin 27.507891
ZONE_D_TRAJ Zone_At_Origin_of_Stop 27.507891
ZONE_A_TRAJ Zone_At_Destination_of_Stop 27.507891
TPS_MAP_ARV Walking_Time_At_Destination 27.507891
id_traj Stop_ID 27.507891
Couteff Estimated_Transport_Cost 22.797448
D12 Travelled_Distance_As_Crow_Flies 7.393303
D13 Travelled_Distance_Declared 7.393303
zoneres.x.1 Residential_Zone_Number 4.315905
motifor Trip_Purpose_at_Origin 4.315905
zoneorig Origin_Zone_of_Trip 4.315905
heuredep Departure_Time_Hour 4.315905
mindep Departure_Time_Minute 4.315905
motifdes Trip_Purpose_at_Destination 4.315905
zonedest Destination_Zone_Number 4.315905
heurearr Arrival_Time_Hour 4.315905
minarr Arrival_Time_Minute 4.315905
duree Trip_Duration_Declared 4.315905
nbmodemec Number_of_Mechanized_Modes_Used 4.315905
NO_TRAJ Trip_Element_Number 4.315905
mode_V2 Modified_Transport_Mode 4.315905
mode_depl_ag Aggregated_Transport_Mode 4.315905
DEST DEST 4.315905
ORIG ORIG 4.315905
mot_o_red Simplified_Purpose_at_Origin 4.315905
mot_d_red Simplified_Purpose_at_Destination 4.315905
tir Observation_Drawing_Number 0.000000
NO_MEN Household_Number 0.000000
NO_PERS Individual_Number_in_Household 0.000000
NO_DEPL Trip_Number_for_Individual 0.000000
id_men Household_ID 0.000000
id_pers Individual_ID 0.000000
id_depl Trip_ID 0.000000
UN UN 0.000000

AllgreI

Label Missing_Percentage
situveil Activity_on_Previous_Day 90.854922
STAT_TRAV Parking_Difficulties_at_Workplace 88.147668
VAL_ABO Public_Transport_Subscription_Validity 78.044042
PBM_STAT General_Parking_Problems 68.290155
dispovp Has_Access_to_Private_Vehicle 56.437824
zonetrav Work_or_Study_Location_Zone 37.629534
travdom Works_or_Studies_at_Home 35.246114
btt Total_Daily_Travel_Time 16.295337
fqvelo Bicycle_Use_Frequency 6.878238
FQ2R1 Motorized_Two_Wheeler_Use_Frequency_Type_1 6.878238
FQ2R2 Motorized_Two_Wheeler_Use_Frequency_Type_2 6.878238
fqvpcond Car_Use_Frequency_as_Driver 6.878238
fqvppass Car_Use_Frequency_as_Passenger 6.878238
freqtcu Urban_Transport_Use_Frequency 6.878238
freqtram Tramway_Use_Frequency 6.878238
freqrurb Other_Urban_Transport_Use_Frequency 6.878238
freqtransisere Transisere_Transport_Use_Frequency 6.878238
TEL_PORT Has_Mobile_Phone 5.841969
mail Has_Email 5.841969
permis Has_Driving_License 5.841969
etabscol Last_Educational_Institution_Attended 5.841969
OCCU1 Main_Occupation 5.841969
OCCU2 Secondary_Occupation 5.841969
csp Socio_Professional_Category 5.841969
ABO_TC Has_Public_Transport_Subscription 5.841969
freqter Regional_Train_Use_Frequency 5.841969
statut2 Aggregated_Socio_Economic_Status 5.841969
tir Observation_Drawing_Number 0.000000
NO_MEN Household_Number 0.000000
NO_PERS Individual_Number_in_Household 0.000000
zoneres.y Residential_Zone_Number 0.000000
sexe Gender 0.000000
lien Relationship_to_Household_Reference 0.000000
age Age 0.000000
id_men Household_ID 0.000000
id_pers Individual_ID 0.000000
nbd Number_of_Trips_Made 0.000000
UN Unknown 0.000000
cspgroup Grouped_Socio_Professional_Category 0.000000

AllgreM

Label Missing_Percentage
tir Household Code 0.000000
NO_MEN Household Size 0.000000
zoneres.x Residence Area 0.000000
jourdepl Day of Travel 0.000000
TYPE_HAB Housing Type 0.000000
TYPE_OCU Occupancy Type 0.000000
Gare2 Dept of Reference SNCF Station 0.000000
Gare5 Postal Code of Reference SNCF Station 0.000000
telefon Has Telephone 0.000000
annuaire Listed in Directory 9.950556
internet Has Internet 0.000000
VP_DISPO Number of Cars Available 0.000000
GENRE1 Type of Car 1 13.844252
ENERGIE1 Fuel Type of Car 1 13.844252
AN_VP1 Year of Car 1 13.844252
PUIS_VP1 Engine Power of Car 1 13.844252
POSSES1 Ownership Status of Car 1 13.844252
LIEU_STAT1 Parking Location of Car 1 13.844252
TYPE_STAT1 Parking Type of Car 1 13.844252
GENRE2 Type of Car 2 53.522868
ENERGIE2 Fuel Type of Car 2 53.522868
AN_VP2 Year of Car 2 53.522868
PUIS_VP2 Engine Power of Car 2 53.522868
POSSES2 Ownership Status of Car 2 53.522868
LIEU_STAT2 Parking Location of Car 2 53.522868
TYPE_STAT2 Parking Type of Car 2 53.522868
GENRE3 Type of Car 3 92.119901
ENERGIE3 Fuel Type of Car 3 92.119901
AN_VP3 Year of Car 3 92.119901
PUIS_VP3 Engine Power of Car 3 92.119901
POSSES3 Ownership Status of Car 3 92.119901
LIEU_STAT3 Parking Location of Car 3 92.119901
TYPE_STAT3 Parking Type of Car 3 92.119901
GENRE4 Type of Car 4 98.609394
ENERGIE4 Fuel Type of Car 4 98.609394
AN_VP4 Year of Car 4 98.609394
PUIS_VP4 Engine Power of Car 4 98.609394
POSSES4 Ownership Status of Car 4 98.609394
LIEU_STAT4 Parking Location of Car 4 98.609394
TYPE_STAT4 Parking Type of Car 4 98.609394
NB_velo Number of Bikes 0.000000
NB_2Rm Number of Motorcycles 0.000000
COEF_MNG Management Coefficient 0.000000
id_men Household ID 0.000000
id_pers Individual ID 0.000000
id_depl Trip ID 0.000000
id_traj Stop ID 30.438813
nb_pers Number of People 0.000000
nbt2 Total Trips 13.195303
btt2 Total Daily Travel Time 13.195303

Following steps made

Select our variables and merge the datasets

Factorize the variables needed ORIG, DEST, UN, Area_at_origin_of_stop, travel_mode, zoneorig, zonedest, covered_trip_cost, trip_number, Area_at_destination_of_stop, parking_type, stop_id, mode_V2, highway_used, residence_zone_number, residence_area, id_men, id_pers, id_depl, parking_location, trip_day, housing_type, occupancy_type, POSSES1, POSSES2, POSSES3, POSSES4, dept_sncf_station, postal_sncf_station, socio_category_group, OCCU1, OCCU2, permis, work_zone, car_availability, relationship_status, sexe, public_trans_subscription, parking_problems, socio_category, employment_status, nbmodemec

Create new variables

  • real_travel_mode = binary variable “VP”, “Autre” The new data distribution for this binary variable is as follows: 10953, 16937

  • Departure_time

  • Arrival_time

  • Filter out missing values from travel_mode(1258)

Imputing missing values

  • number_of_stops, Area_at_origin_of_stop,Area_at_destination_of_stop,walk_time_at_destination,walk_time_at_origin,stop_id. If there is no Stop => We can replace missing values with no stop

  • parking_duration and highway_used => If the person didn’t travel by car ==> No highway used, no parking used

  • car_availability => We used the travel_mode, number_of_cars, num_people to impute those values

Deal with trip loops

id_pers total_trips purpose_sequence mode_sequence
903116001 2 ACHAT -> DOMICILE VP->VP
903116002 9 TRAVAIL -> TRAVAIL -> TRAVAIL -> DOMICILE -> ACHAT -> LOISIR -> LOISIR -> LOISIR -> DOMICILE VP->Autre->Autre->VP->VP->VP->VP->VP->VP
903116003 3 ACHAT -> LOISIR -> LOISIR MAP->MAP->MAP
903117001 7 ACHAT -> ACHAT -> DOMICILE -> TRAVAIL -> ACCOMPAGNEMENT -> LOISIR -> DOMICILE MAP->MAP->MAP->VP->VP->VP->VP
903122001 4 ACHAT -> DOMICILE -> LOISIR -> DOMICILE MAP->MAP->MAP->MAP
903122002 2 LOISIR -> DOMICILE VP->VP

Old data

id_pers departure_time travel_mode mode_switch mode_group trip_sequence total_trips_same_mode total_crownTravel_distance total_actualTravel_distance total_declared_trip_duration arrival_time end_arrival_time
101019001 09:45 MAP FALSE 0 1 4 2783 3.199 48 10:00 15:58
101019001 10:30 MAP FALSE 0 2 4 2783 3.199 48 10:45 15:58
101019001 10:50 MAP FALSE 0 3 4 2783 3.199 48 10:55 15:58
101019001 15:45 MAP FALSE 0 4 4 2783 3.199 48 15:58 15:58
101019001 16:05 TCU TRUE 1 1 2 7194 9.999 50 16:25 17:40
101019001 17:10 TCU FALSE 1 2 2 7194 9.999 50 17:40 17:40
101019001 18:00 MAP TRUE 2 1 4 8405 9.665 145 19:07 24:05
101019001 19:07 MAP FALSE 2 2 4 8405 9.665 145 20:15 24:05
101019001 21:50 MAP FALSE 2 3 4 8405 9.665 145 21:55 24:05
101019001 24:00 MAP FALSE 2 4 4 8405 9.665 145 24:05 24:05

New data

id_pers travel_mode mode_switch mode_group trip_sequence total_trips_same_mode total_crownTravel_distance total_actualTravel_distance total_declared_trip_duration departure_time arrival_time end_arrival_time
101019001 MAP FALSE 0 1 4 2783 3.199 48 09:45 10:00 15:58
101019001 TCU TRUE 1 1 2 7194 9.999 50 16:05 16:25 17:40
101019001 MAP TRUE 2 1 4 8405 9.665 145 18:00 19:07 24:05

Total Trips with Same Mode

The analysis reveals a strong positive relationship between the number of consecutive trips made with the same mode and the percentage of car users. Specifically, when individuals anticipate making multiple trips consecutively, the likelihood of choosing a car increases significantly. The percentage of car users rises from 38% for a single trip to its peak value as the total trips made with the same mode approach 16. This trend underscores the preference for cars in scenarios involving repeated, consistent travel.

We then filtered out trips that have more than 13 consecutive trips using the car, because these probably indicates for taxi drivers that we don’t want to consider in our analysis

Insights

Travel Distance and Duration Analysis

The analysis of travel distance highlights that MAP covers the shortest distances, which is expected given its localized nature. VP distances fall between TCU and TCIU, aligning with the car’s flexibility for both urban and intercity travel. Interestingly, trips classified under “Autre” also cover substantial distances, indicating diverse long-distance usage. For travel duration, despite the longer distances, car trips exhibit shorter durations compared to TCU, TCIU, and “Autre.” This suggests a preference for cars when time is a critical factor,

Location Effect on travel mode

Distribution of the travel mode by residence area


 inside_the_city outside_the_city 
            5962             4079 

The analysis indicates that car usage is evenly distributed among individuals living inside and outside the city. In contrast, other travel modes from the dataset are predominantly used within the city. This suggests that while cars offer consistent utility regardless of location, alternative modes cater more to urban travel needs.

Distribution of travel_mode by trip type


    city_to_city   inside_outside outside_the_city  Within_the_city 
             319             1591             2404             5727 

The analysis reveals a significant disparity in trip types, with the majority of observations corresponding to trips made within the same city, such as Grenoble, Voiron, St Marcellin, or La Touvet. Alternative modes of transport are predominantly used for within-city travel, whereas cars dominate for intercity and city-to-outside trips. This highlights the reliance on cars for longer and less localized travel.

Effect of distance and duration

Analyzing distance and duration alongside trip types confirms that shorter distances correlate with trips within or just outside the city, while longer distances are associated with intercity or city-to-outside trips—an expected pattern. Within-city and outside-city trips show a preference for cars when distances increase, as they also minimize travel time. For inside-outside or city-to-city trips, the travel distance remains similar regardless of the mode, which is logical given the fixed nature of such routes. However, car users consistently achieve these trips in less time, reinforcing the car’s efficiency for longer travel.

Factors Influencing Household Car Ownership

In this analysis, we examined several key factors influencing car ownership using variables from a research paper on the same topic. These factors include the number of children and adults in the household, urban area, income, and whether the individual lives alone

To do so, we created several new variables: Age_group, categorized as teenagers, young adults, adults, and seniors based on the age variable; Urban_area, defined as major city (for Grenoble or Voiron), suburb (for St Marcellin or La Touvet), and rural (for locations outside any city); and Income, derived from the combination of the csp variable (representing profession) and occu1 and occu2 (indicating employment status such as full-time, part-time, or student). These new variables were essential in understanding the factors influencing car ownership.

% Diffrence table

category group percent_change_from_benchmark chisq.test_p_value
Number of Children(Benchmark: No Children in the household) One Child
  • 9.51 %
8.43e-15
Number of Children(Benchmark: No Children in the household) Two Children
  • 11.8 %
8.43e-15
Number of Children(Benchmark: No Children in the household) Three or more Children
  • 13.9 %
8.43e-15
Number of Adults(Benchmark: 1 adult in the household Three or more adults
  • 20.9171741121949 %
6.53e-69
Number of Adults(Benchmark: 1 adult in the household Two adults
  • 21.4170515932044 %
6.53e-69
Urban Area(Benchmark Major City) Suburb
  • 9.65400967600058 %
1.05e-29
Urban Area(Benchmark Major City) Rural
  • 14.3050023105062 %
1.05e-29
Income Group(Benchmark : No income) High income
  • 59.6429139072848 %
3.94e-96
Income Group(Benchmark : No income) Low income
  • 54.1867671182939 %
3.94e-96
Income Group(Benchmark : No income) Medium income
  • 56.7698104590089 %
3.94e-96
Living alone(Benchmark: No) 1 -22.29 % 6.02e-72

The data shows a clear trend: as the number of children increases, the likelihood of car ownership rises, with a 9.5% increase for one child, 11.8% for two, and 13.9% for three or more. Similarly, the percentage of households owning a car increases by 21% as the number of adults grows. Urban area also plays a role, with car ownership higher in suburban (9.65%) and rural (14.3%) areas compared to major cities. Income follows the same pattern: higher household income correlates with increased car ownership. Additionally, individuals living alone are less likely to own a car. Importantly, the chi-square test p-value confirms that there is a statistically significant association between these variables and whether a household owns a car, reinforcing the observed patterns.

Predictive Analysis of Car Ownership

  • First Model
Train Data:

   0    1 
 290 2132 

We can see that the data is imbalanced. To solve this issue we will perform the SMOTE technique:

Smote Data:

   0    1 
 870 1160 

Call:
 randomForest(formula = have_car ~ living_alone + number_of_adults_group +      +have_alternatives + number_of_children + income_value +      urban_area, data = smote_train, importance = TRUE) 
               Type of random forest: classification
                     Number of trees: 500
No. of variables tried at each split: 2

        OOB estimate of  error rate: 19.21%
Confusion matrix:
    0   1 class.error
0 724 146   0.1678161
1 244 916   0.2103448

The random forest model’s out-of-bag (OBB) error rate is 19.2118227 %, meaning approximately 80.7881773% of predictions were correctly classified. Upon analyzing the variable importance, we found the four most important factors are number_of_adults, followed by the income in the household, whether there is only 1 individual living_alone and urban area. To optimize the model, we can try to remove the least important variables.

  • Confusion Matrix Metrics
Accuracy of the model:  0.7433775
Confusion Matrix: Test Set:

Confusion Matrix and Statistics

          Reference
Prediction   0   1
         0  51 134
         1  21 398
                                          
               Accuracy : 0.7434          
                 95% CI : (0.7066, 0.7778)
    No Information Rate : 0.8808          
    P-Value [Acc > NIR] : 1               
                                          
                  Kappa : 0.2719          
                                          
 Mcnemar's Test P-Value : <2e-16          
                                          
            Sensitivity : 0.70833         
            Specificity : 0.74812         
         Pos Pred Value : 0.27568         
         Neg Pred Value : 0.94988         
             Prevalence : 0.11921         
         Detection Rate : 0.08444         
   Detection Prevalence : 0.30629         
      Balanced Accuracy : 0.72823         
                                          
       'Positive' Class : 0               
                                          

The model shows an accuracy of 74.3377483, but with a significant imbalance in the class distribution, as the No Information Rate (NIR) is much higher at NA, indicating that predicting the majority class without any model would be more accurate. The sensitivity for Class 0 is 70.8333333, which is good, but the positive predictive value (PPV) is quite low at 27.5675676, suggesting that many of the positive predictions for Class 0 are incorrect. The model’s balanced accuracy of 74.3377483` indicates that improvements are needed, especially in predicting Class 0, potentially through techniques like class balancing or model adjustments.

Factors Influencing Car Usage Among Car Owners

Full Sample

Car owner’s

Non car owners

++

The pie charts reveal that car usage is the most prevalent mode of transport across all three trip purposes, with the highest percentage for commuting (home to work), followed by leisure (shopping, leisure, and accompaniment), and lastly for education. Walking is also a notable mode for both education and leisure trips.

Regular access to a car emerges as the key determinant of transport mode choice, reflected in the marked differences between those with and without access to a car. However, car access alone doesn’t fully explain the mode choice, necessitating an exploration of additional factors by comparing respondents from households with car access but differing in other characteristics..

Urban Area

% Difference in Rural Areas

In rural areas, car usage has significantly risen for all trip types—commuting, leisure, and educational—compared to the major city, with increases of 12.1%, 4.3%, and 11.7% respectively. This shift to car usage in rural areas has mostly replaced TCU and MAP. For educational trips, the decline in TCU and MAP is not only offset by cars but also by TCIU and other modes of transportation. Overall, these findings show that moving away from urban areas leads to a greater reliance on cars for all trip types, with the most significant changes occurring in rural areas, especially for leisure and commuting activities.

% Difference in Suburb Areas

In the suburb area, car usage has increased compared to the major city, especially for commuting and leisure trips, with the percentage increase of 6.9% and 9.1% respectively. However, for educational trips, the percentage of car usage had decreased. There has also been a rise in the use of TCIU for commuting and educational trips, with the increase being more noticeable for educational trips and, to a lesser extent, for commuting. The growth in car usage is largely replacing walking (MAP), which has decreased for all trip purposes. Notably, the increase in car usage for leisure activities has primarily replaced TCU.

Chi-Squared Analysis

Chi-squared test of urban areas and travel mode for commuting:

    Pearson's Chi-squared test

data:  table_commuting
X-squared = 258.4, df = 8, p-value < 2.2e-16

Chi-squared test of urban areas and travel mode for education:

    Pearson's Chi-squared test

data:  table_education
X-squared = 248.9, df = 8, p-value < 2.2e-16

Chi-squared test of urban areas and travel mode for leisure:

    Pearson's Chi-squared test

data:  table_leisure
X-squared = 275.87, df = 8, p-value < 2.2e-16

The results of the Chi-squared tests for the relationship between urban area and travel mode across different trip purposes (commuting, education, and leisure) show a highly significant association, with p-values less than 0.05 (indeed, much smaller), indicating a strong statistical relationship between urban area and travel mode choices. These findings align with the previous analysis, where we observed that car usage increases in suburban and rural areas compared to major cities.

Income influence

           
            High Income Low Income Medium Income No Income
  Commuting         714        427          1344       433
  Education           1          4             0      1236
  Leisure           477       2350          1065       714

The analysis reveals a significant increase in car usage as income rises, particularly for commuting trips, where high-income individuals show more than a 20% increase in car usage compared to No-income individuals. This increase mainly replaced trips made by MAP, but also to a lesser extent TCU and TCIU. For leisure activities, there is a notable rise in car usage with increasing income, along with a surprising increase in MAP usage as well. Additionally, for leisure trips, higher income leads to a decrease in the usage of other transport modes (TCU, TCIU, Autres).

We did not try our analysis for education trips, since 99% of the individuals who still study, are without any income.

Chi-Squared Analysis

Chi-squared test of income and travel mode for commuting:

    Pearson's Chi-squared test

data:  table_commuting
X-squared = 225.56, df = 12, p-value < 2.2e-16

Chi-squared test of income and travel mode for leisure:

    Pearson's Chi-squared test

data:  table_leisure
X-squared = 11.743, df = 8, p-value = 0.1631

The Chi-squared test results confirm that income significantly influences car usage for commuting purposes, with p-value less than 0.05, supporting the previous analysis that income impacts travel mode choices for these trips. However, for leisure trips, the test yielded a p-value of 0.1631, indicating no significant relationship between income and travel mode, which contradicts the earlier analysis suggesting income might influence car usage for leisure. Thus, while income affects car usage for commuting and education, it does not significantly impact leisure trip choices.

Distance Influence

The analysis shows a substantial increase in car usage, with a rise of up to 50% for both leisure and commuting trips as distances increase from short to long. However, there is also a slight uptick in the use of other modes, such as TCU, TCIU, and Autres, which replace MAP, showing a decrease of more than 80%. For education trips, car usage also increases with distance, though not as significantly as for commuting and leisure. Interestingly, for long distances, car usage decreases between medium and long distances, primarily replaced by TCIU. For all three trip purposes, TCU usage declines as distance grows, which is logical, as buses and trams become less practical for longer journeys.

Chi-Squared Analysis

Chi-squared test of distance and travel mode for commuting:

    Pearson's Chi-squared test

data:  table_commuting
X-squared = 1553.3, df = 8, p-value < 2.2e-16
Chi-squared test of distance and travel mode for education:

    Pearson's Chi-squared test

data:  table_education
X-squared = 2287.4, df = 8, p-value < 2.2e-16

Chi-squared test of distance and travel mode for leisure:

    Pearson's Chi-squared test

data:  table_leisure
X-squared = 841.8, df = 8, p-value < 2.2e-16

The results of the Chi-squared tests for the relationship between distance category (small, medium, or long distances) and travel mode for commuting, education, and leisure trips all show a highly significant association, with p-values much smaller than 0.05. This confirms that the distance category has a statistically significant influence on travel mode choices across different trip purposes. The significant Chi-squared results further confirm that longer distances are strongly associated with a higher reliance on cars

Extra variables

category travel_mode group percent_diff chi_squared_p_value
sexe(Benchmark: Male) Autre 2 -3.4781354 4.42e-14
sexe(Benchmark: Male) MAP 2 3.5519556 4.42e-14
sexe(Benchmark: Male) TCIU 2 0.4685953 4.42e-14
sexe(Benchmark: Male) TCU 2 2.4293033 4.42e-14
sexe(Benchmark: Male) VP 2 -2.9717188 4.42e-14
Senior(Benchmark :NO) Autre 1 -6.1290951 8.80e-27
Senior(Benchmark :NO) MAP 1 4.6538475 8.80e-27
Senior(Benchmark :NO) TCIU 1 -2.2187492 8.80e-27
Senior(Benchmark :NO) TCU 1 -0.2084964 8.80e-27
Senior(Benchmark :NO) VP 1 3.9024932 8.80e-27
Has Permit to drive(Benchmark :Yes) Autre 0 6.8671246 1.12e-109
Has Permit to drive(Benchmark :Yes) MAP 0 6.0395404 1.12e-109
Has Permit to drive(Benchmark :Yes) TCIU 0 4.5943309 1.12e-109
Has Permit to drive(Benchmark :Yes) TCU 0 4.7697513 1.12e-109
Has Permit to drive(Benchmark :Yes) VP 0 -22.2707471 1.12e-109
Have alternative(Benchmark :NO) Autre 1 6.5235761 3.73e-20
Have alternative(Benchmark :NO) MAP 1 -4.3218818 3.73e-20
Have alternative(Benchmark :NO) TCIU 1 1.5134869 3.73e-20
Have alternative(Benchmark :NO) TCU 1 -2.7685434 3.73e-20
Have alternative(Benchmark :NO) VP 1 -0.9466378 3.73e-20
Student(Benchmark :NO) Autre 1 7.8308608 3.41e-105
Student(Benchmark :NO) MAP 1 3.9088550 3.41e-105
Student(Benchmark :NO) TCIU 1 4.4446268 3.41e-105
Student(Benchmark :NO) TCU 1 4.8092633 3.41e-105
Student(Benchmark :NO) VP 1 -20.9936059 3.41e-105
Living Alone(Benchmark :NO) Autre 1 -2.3387713 7.36e-09
Living Alone(Benchmark :NO) MAP 1 3.6393719 7.36e-09
Living Alone(Benchmark :NO) TCIU 1 -1.9057719 7.36e-09
Living Alone(Benchmark :NO) TCU 1 2.6062184 7.36e-09
Living Alone(Benchmark :NO) VP 1 -2.0010471 7.36e-09

The analysis of mode of transport usage reveals several trends based on key demographic and lifestyle factors.

Sex : While there is no significant overall difference, females have a slightly higher likelihood of using cars (+3%) and other modes of transport (+3.5%). However, they show a 3.5% decrease in MAP usage compared to males.This could be linked to social and cultural factors where women might be more likely to use cars for commuting or family-related activities.

Senior vs Non-Senior: Seniors (>55 years) exhibit higher car and MAP usage, while other modes, especially “Autres,” show a significant decrease of 6.1%. Older adults may prefer the convenience and comfort of cars or MAP due to mobility challenges or a lack of access to public transportation. They may also be less inclined to use public transport for long distances or during non-peak hours. The large 6.1% decrease in “Autres” usage for seniors is likely due to limited use of alternatives like bicycles or motorcycles, which are less practical or less accessible for this age group.

Not Having a Driving Permit: Individuals without a driving permit are 22% less likely to use a car. This is logical, as people without a driving permit cannot drive a car, thus reducing their chances of choosing this mode of transport. This group is more likely to rely on alternative modes of transport, including walking, public transport, or other forms of mobility.

Alternatives (Motorcycle or Bicycle) : Individuals with alternative transport modes show a large increase in “Autres” usage and a decrease in MAP. Car usage slightly decreases.Individuals with alternative transport modes such as motorcycles or bicycles often use these for short trips or leisure, resulting in increased usage of “Autres” (alternative modes). The decrease in MAP and TCU usage makes sense, as people with alternatives are less likely to use public transport. The slight decrease in car usage suggests that while alternatives offer a viable option, they don’t fully replace the car for longer or more essential trips.

Living Alone: Those living alone are more likely to walk or use TCU, and less likely to use cars or other transport modes.Without the need to coordinate travel with family or friends, they might prefer more flexible, accessible modes like walking or TCU

Importantly, the results of the Chi-squared tests for all these factors show very low p-values, indicating that there is a statistically significant association between these variables (sex, senior status, age group, alternative transport modes, and living alone) and the choice of travel mode. This suggests that the observed differences in travel behavior are not due to chance, reinforcing the trends identified in the descriptive analysis.

Machine Leanring

Targer predict: real_travel_mode Based on has_car, trip_category, distance_category, urban_area, income, living_alone, age_group, has_alternatives, sexe, permis

tibble [9,617 × 11] (S3: tbl_df/tbl/data.frame)
 $ real_travel_mode : Factor w/ 2 levels "Autre","VP": 1 2 1 1 1 1 1 1 1 1 ...
 $ has_car          : Factor w/ 2 levels "0","1": 1 2 2 1 1 1 1 1 1 1 ...
 $ trip_category    : Factor w/ 3 levels "Commuting","Education",..: 3 1 3 3 1 1 3 3 3 3 ...
 $ distance_category: Factor w/ 3 levels "Long","Medium",..: 3 1 2 2 2 2 2 2 3 2 ...
 $ urban_area       : Factor w/ 3 levels "Major city","Rural",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ income           : Factor w/ 4 levels "High Income",..: 2 1 3 3 3 3 3 3 3 3 ...
 $ living_alone     : Factor w/ 2 levels "0","1": 2 1 1 1 1 1 1 1 1 1 ...
 $ age_group        : Factor w/ 4 levels "Adults","Seniors",..: 4 1 4 4 4 4 4 4 4 4 ...
 $ has_alternatives : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
 $ sexe             : Factor w/ 2 levels "1","2": 1 1 2 1 1 1 2 2 2 2 ...
 $ permis           : Factor w/ 2 levels "0","1": 1 2 2 1 1 1 1 1 1 1 ...
[1] "train_data"

Autre    VP 
 3338  3395 
[1] "test_data"

Autre    VP 
 1430  1454 

Random Forest model


Call:
 randomForest(formula = real_travel_mode_binary ~ . - real_travel_mode,      data = trainData, importance = TRUE) 
               Type of random forest: classification
                     Number of trees: 500
No. of variables tried at each split: 3

        OOB estimate of  error rate: 22.87%
Confusion matrix:
     0    1 class.error
0 2396  942   0.2822049
1  598 2797   0.1761414

The random forest model shows an overall Out-of-Bag (OOB) error rate of 22.53%, reflecting its performance on unseen training data. The confusion matrix indicates that for Class 0, 2,387 instances were correctly classified while 951 were misclassified, resulting in a class error rate of 28.49%. For Class 1, the model performed better, correctly classifying 2,829 instances with 566 misclassifications, yielding a lower class error rate of 16.67%. This suggests the model is more effective at predicting Class 1, but it struggles more with accurately predicting Class 0. To address this imbalance, methods such as class balancing or hyperparameter tuning may be beneficial.

Confusion Matrix Train Set:

Confusion Matrix and Statistics

          Reference
Prediction    0    1
         0 2396  598
         1  942 2797
                                          
               Accuracy : 0.7713          
                 95% CI : (0.7611, 0.7813)
    No Information Rate : 0.5042          
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.5421          
                                          
 Mcnemar's Test P-Value : < 2.2e-16       
                                          
            Sensitivity : 0.7178          
            Specificity : 0.8239          
         Pos Pred Value : 0.8003          
         Neg Pred Value : 0.7481          
             Prevalence : 0.4958          
         Detection Rate : 0.3559          
   Detection Prevalence : 0.4447          
      Balanced Accuracy : 0.7708          
                                          
       'Positive' Class : 0               
                                          
Confusion Matrix: Test Set:

Confusion Matrix and Statistics

          Reference
Prediction    0    1
         0 1022  240
         1  408 1214
                                          
               Accuracy : 0.7753          
                 95% CI : (0.7596, 0.7904)
    No Information Rate : 0.5042          
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.5502          
                                          
 Mcnemar's Test P-Value : 5.367e-11       
                                          
            Sensitivity : 0.7147          
            Specificity : 0.8349          
         Pos Pred Value : 0.8098          
         Neg Pred Value : 0.7485          
             Prevalence : 0.4958          
         Detection Rate : 0.3544          
   Detection Prevalence : 0.4376          
      Balanced Accuracy : 0.7748          
                                          
       'Positive' Class : 0               
                                          

The confusion matrices on both the train and test sets reveal consistent performance, with accuracy scores of 77.42% on the train set and 77.95% on the test set. The sensitivity and specificity on both sets are similar, with the test set showing a sensitivity of 70.84% and specificity of 84.94%, and the train set showing 71.30% sensitivity and 83.45%. Importantly, the minimal differences in performance between the train and test sets suggest that there is no significant overfitting, as the model generalizes well to unseen data. While the model performs better on Class 1, with higher specificity and positive predictive value, there is room for improvement in detecting Class 0.To address this imbalance, methods such as class balancing or hyperparameter tuning may be beneficial

Logit model


Call:
glm(formula = real_travel_mode_binary ~ . - real_travel_mode, 
    family = "binomial", data = trainData)

Coefficients:
                         Estimate Std. Error z value Pr(>|z|)    
(Intercept)             -17.51731  201.16326  -0.087   0.9306    
has_car1                 18.69323  201.16320   0.093   0.9260    
trip_categoryEducation   -0.87061    0.12083  -7.205 5.80e-13 ***
trip_categoryLeisure     -0.01951    0.07614  -0.256   0.7977    
distance_categoryMedium  -0.40237    0.07704  -5.223 1.76e-07 ***
distance_categoryShort   -2.44829    0.09592 -25.523  < 2e-16 ***
urban_areaRural           0.32248    0.06458   4.993 5.94e-07 ***
urban_areaSuburb          0.02571    0.14468   0.178   0.8590    
incomeLow Income         -0.23742    0.11557  -2.054   0.0400 *  
incomeMedium Income       0.10062    0.10538   0.955   0.3397    
incomeNo Income           0.21868    0.19636   1.114   0.2654    
living_alone1            -0.29387    0.09366  -3.138   0.0017 ** 
age_groupSeniors          0.03795    0.09563   0.397   0.6915    
age_groupTeenagers       -0.60273    0.24333  -2.477   0.0133 *  
age_groupYoung Adults    -0.06836    0.09806  -0.697   0.4857    
has_alternatives1        -0.02505    0.08630  -0.290   0.7716    
sexe2                    -0.09590    0.06215  -1.543   0.1228    
permis1                   0.24419    0.17206   1.419   0.1559    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 9333.4  on 6732  degrees of freedom
Residual deviance: 6596.0  on 6715  degrees of freedom
AIC: 6632

Number of Fisher Scoring iterations: 17
Start:  AIC=6631.96
real_travel_mode_binary ~ (real_travel_mode + has_car + trip_category + 
    distance_category + urban_area + income + living_alone + 
    age_group + has_alternatives + sexe + permis) - real_travel_mode

                    Df Deviance    AIC
- has_alternatives   1   6596.0 6630.0
- permis             1   6598.0 6632.0
<none>                   6596.0 6632.0
- sexe               1   6598.3 6632.3
- age_group          3   6602.6 6632.6
- living_alone       1   6605.7 6639.7
- income             3   6610.4 6640.4
- urban_area         2   6621.6 6653.6
- trip_category      2   6655.8 6687.8
- has_car            1   7423.9 7457.9
- distance_category  2   7571.5 7603.5

Step:  AIC=6630.04
real_travel_mode_binary ~ has_car + trip_category + distance_category + 
    urban_area + income + living_alone + age_group + sexe + permis

                    Df Deviance    AIC
- permis             1   6598.0 6630.0
<none>                   6596.0 6630.0
- sexe               1   6598.4 6630.4
- age_group          3   6602.9 6630.9
- living_alone       1   6605.9 6637.9
- income             3   6610.4 6638.4
- urban_area         2   6621.6 6651.6
- trip_category      2   6655.9 6685.9
- has_car            1   7425.4 7457.4
- distance_category  2   7571.5 7601.5

Step:  AIC=6630
real_travel_mode_binary ~ has_car + trip_category + distance_category + 
    urban_area + income + living_alone + age_group + sexe

                    Df Deviance    AIC
<none>                   6598.0 6630.0
- sexe               1   6600.8 6630.8
- living_alone       1   6607.0 6637.0
- income             3   6613.2 6639.2
- age_group          3   6619.7 6645.7
- urban_area         2   6624.0 6652.0
- trip_category      2   6657.5 6685.5
- distance_category  2   7573.8 7601.8
- has_car            1   7615.3 7645.3

The stepwise model selection process shows that the most important predictors for the model are likely to be has_car, income, and possibly age_group. Removing any of these variables results in a noticeable increase in the AIC, suggesting they are crucial for the model’s performance. Conversely, variables like permis and trip_category seem to have less impact, and their removal did not significantly affect the model’s goodness of fit. Overall, the model becomes simpler with the removal of permis, and the remaining variables seem to provide a reasonable fit, as indicated by the minimal increase in AIC after each step.

summary(model1)

Call:
glm(formula = real_travel_mode_binary ~ . - real_travel_mode, 
    family = "binomial", data = trainData)

Coefficients:
                         Estimate Std. Error z value Pr(>|z|)    
(Intercept)             -17.51731  201.16326  -0.087   0.9306    
has_car1                 18.69323  201.16320   0.093   0.9260    
trip_categoryEducation   -0.87061    0.12083  -7.205 5.80e-13 ***
trip_categoryLeisure     -0.01951    0.07614  -0.256   0.7977    
distance_categoryMedium  -0.40237    0.07704  -5.223 1.76e-07 ***
distance_categoryShort   -2.44829    0.09592 -25.523  < 2e-16 ***
urban_areaRural           0.32248    0.06458   4.993 5.94e-07 ***
urban_areaSuburb          0.02571    0.14468   0.178   0.8590    
incomeLow Income         -0.23742    0.11557  -2.054   0.0400 *  
incomeMedium Income       0.10062    0.10538   0.955   0.3397    
incomeNo Income           0.21868    0.19636   1.114   0.2654    
living_alone1            -0.29387    0.09366  -3.138   0.0017 ** 
age_groupSeniors          0.03795    0.09563   0.397   0.6915    
age_groupTeenagers       -0.60273    0.24333  -2.477   0.0133 *  
age_groupYoung Adults    -0.06836    0.09806  -0.697   0.4857    
has_alternatives1        -0.02505    0.08630  -0.290   0.7716    
sexe2                    -0.09590    0.06215  -1.543   0.1228    
permis1                   0.24419    0.17206   1.419   0.1559    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 9333.4  on 6732  degrees of freedom
Residual deviance: 6596.0  on 6715  degrees of freedom
AIC: 6632

Number of Fisher Scoring iterations: 17

The model has a p-value for the overall significance (as indicated by the Wald test) of less than 2e-16, which suggests that the model is highly significant and provides a good fit to the data. Despite this, the large negative intercept and some insignificant predictors imply there may be potential for further refinement to improve predictive power.

For the variables:

Strong Negative Influences: Short trips and education-related trips significantly reduce the likelihood of car usage. Short trips, in particular, have a very strong negative effect. Teenagers also show a strong reduction in the likelihood of car usage.

Strong Positive Influence: The presence of a car (has_car1) has a very strong positive impact on car usage, with a coefficient of +18.8. This suggests that having a car greatly increases the likelihood of its usage. Similarly, living in rural areas also has a positive impact on car usage, indicating a higher likelihood of car usage compared to urban areas.

Moderate Effects: Medium-distance trips and low-income individuals moderately reduce the likelihood of car usage, while medium-income individuals show a slight increase in car usage.

Minimal Impact: Leisure trips (trip_categoryLeisure) and living in suburban areas (urban_areaSuburb) show small, relatively weak effects on car usage. Their coefficients suggest that these variables have little to no impact on car usage compared to the more influential factors.

The variables has_car1 is not statistically significant, as its p-values is >0.05. However, since the stepwise selection process did not exclude it, it is likely retained in the model to avoid reducing its explanatory power or altering its structure, which could lead to the loss of important relationships or an over-simplified model.

Confusion Matrix: Train Set:

Confusion Matrix and Statistics

          Reference
Prediction    0    1
         0 2236  483
         1 1102 2912
                                          
               Accuracy : 0.7646          
                 95% CI : (0.7543, 0.7747)
    No Information Rate : 0.5042          
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.5284          
                                          
 Mcnemar's Test P-Value : < 2.2e-16       
                                          
            Sensitivity : 0.6699          
            Specificity : 0.8577          
         Pos Pred Value : 0.8224          
         Neg Pred Value : 0.7255          
             Prevalence : 0.4958          
         Detection Rate : 0.3321          
   Detection Prevalence : 0.4038          
      Balanced Accuracy : 0.7638          
                                          
       'Positive' Class : 0               
                                          
Confusion Matrix: Test Set:

Confusion Matrix and Statistics

          Reference
Prediction    0    1
         0  945  195
         1  485 1259
                                          
               Accuracy : 0.7642          
                 95% CI : (0.7483, 0.7796)
    No Information Rate : 0.5042          
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.5276          
                                          
 Mcnemar's Test P-Value : < 2.2e-16       
                                          
            Sensitivity : 0.6608          
            Specificity : 0.8659          
         Pos Pred Value : 0.8289          
         Neg Pred Value : 0.7219          
             Prevalence : 0.4958          
         Detection Rate : 0.3277          
   Detection Prevalence : 0.3953          
      Balanced Accuracy : 0.7634          
                                          
       'Positive' Class : 0               
                                          

Same Conclusion for the confusion matrices for the logit model.the minimal differences in performance between the train and test sets suggest that there is no significant overfitting, as the model generalizes well to unseen data. While the model performs better on Class 1, with higher specificity and positive predictive value, there is room for improvement in detecting Class 0.To address this imbalance, methods such as class balancing or hyperparameter tuning may be beneficial